18 research outputs found
BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models
Self-supervised techniques for learning speech representations have been
shown to develop linguistic competence from exposure to speech without the need
for human labels. In order to fully realize the potential of these approaches
and further our understanding of how infants learn language, simulations must
closely emulate real-life situations by training on developmentally plausible
corpora and benchmarking against appropriate test sets. To this end, we propose
a language-acquisition-friendly benchmark to probe spoken language models at
the lexical and syntactic levels, both of which are compatible with the
vocabulary typical of children's language experiences. This paper introduces
the benchmark and summarizes a range of experiments showing its usefulness. In
addition, we highlight two exciting challenges that need to be addressed for
further progress: bridging the gap between text and speech and between clean
speech and in-the-wild speech.Comment: Proceedings of Interspeech 202
Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation
Most automatic speech processing systems are sensitive to the acoustic
environment, with degraded performance when applied to noisy or reverberant
speech. But how can one tell whether speech is noisy or reverberant? We propose
Brouhaha, a pipeline to simulate audio segments recorded in noisy and
reverberant conditions. We then use the simulated audio to jointly train the
Brouhaha model for voice activity detection, signal-to-noise ratio estimation,
and C50 room acoustics prediction. We show how the predicted SNR and C50 values
can be used to investigate and help diagnose errors made by automatic speech
processing tools (such as pyannote.audio for speaker diarization or OpenAI's
Whisper for automatic speech recognition). Both our pipeline and a pretrained
model are open source and shared with the speech community
ProsAudit, a prosodic benchmark for self-supervised speech models
We present ProsAudit, a benchmark in English to assess structural prosodic
knowledge in self-supervised learning (SSL) speech models. It consists of two
subtasks, their corresponding metrics, and an evaluation dataset. In the
protosyntax task, the model must correctly identify strong versus weak prosodic
boundaries. In the lexical task, the model needs to correctly distinguish
between pauses inserted between words and within words. We also provide human
evaluation scores on this benchmark. We evaluated a series of SSL models and
found that they were all able to perform above chance on both tasks, even when
evaluated on an unseen language. However, non-native models performed
significantly worse than native ones on the lexical task, highlighting the
importance of lexical knowledge in this task. We also found a clear effect of
size with models trained on more data performing better in the two subtasks.Comment: Accepted at Interspeech 2023. 4 pages + references, 1 figur
Vocal markers from sustained phonation in Huntington's Disease
Disease-modifying treatments are currently assessed in neurodegenerative
diseases. Huntington's Disease represents a unique opportunity to design
automatic sub-clinical markers, even in premanifest gene carriers. We
investigated phonatory impairments as potential clinical markers and propose
them for both diagnosis and gene carriers follow-up. We used two sets of
features: Phonatory features and Modulation Power Spectrum Features. We found
that phonation is not sufficient for the identification of sub-clinical
disorders of premanifest gene carriers. According to our regression results,
Phonatory features are suitable for the predictions of clinical performance in
Huntington's Disease.Comment: To appear at INTERSPEECH 2020. 1 pages of supplementary material
appear only in the arxiv version. Code to replicate
https://github.com/bootphon/sustained-phonation-feature
A comparison study on patient-psychologist voice diarization
International audienceConversations between a clinician and a patient, in natural conditions, are valuable sources of information for medical follow-up. The automatic analysis of these dialogues could help extract new language markers and speed up the clinicians' reports. Yet, it is not clear which model is the most efficient to detect and identify the speaker turns, especially for individuals with speech disorders. Here, we proposed a split of the data that allows conducting a comparative evaluation of different diarization methods. We designed and trained end-to-end neural network architectures to directly tackle this task from the raw signal and evaluate each approach under the same metric. We also studied the effect of fine-tuning models to find the best performance. Experimental results are reported on naturalistic clinical conversations between Psychologists and Interviewees, at different stages of Huntington's disease, displaying a large panel of speech disorders. We found out that our best end-to-end model achieved 19.5% IER on the test set, compared to 23.6% achieved by the finetuning of the X-vector architecture. Finally, we observed that we could extract clinical markers directly from the automatic systems, highlighting the clinical relevance of our methods
Vocal markers from sustained phonation in Huntington's Disease
To appear at INTERSPEECH 2020. 1 pages of supplementary material appear only in the arxiv version. Code to replicate https://github.com/bootphon/sustained-phonation-featuresInternational audienceDisease-modifying treatments are currently assessed in neurodegenerative diseases. Huntington's Disease represents a unique opportunity to design automatic sub-clinical markers, even in premanifest gene carriers. We investigated phonatory impairments as potential clinical markers and propose them for both diagnosis and gene carriers follow-up. We used two sets of features: Phonatory features and Modulation Power Spectrum Features. We found that phonation is not sufficient for the identification of sub-clinical disorders of premanifest gene carriers. According to our regression results, Phonatory features are suitable for the predictions of clinical performance in Huntington's Disease
Speaker detection in the wild: Lessons learned from JSALT 2019
Submitted to ICASSP 2020This paper presents the problems and solutions addressed at the JSALT workshop when using a single microphone for speaker detection in adverse scenarios. The main focus was to tackle a wide range of conditions that go from meetings to wild speech. We describe the research threads we explored and a set of modules that was successful for these scenarios. The ultimate goal was to explore speaker detection; but our first finding was that an effective diarization improves detection, and not having a diarization stage impoverishes the performance. All the different configurations of our research agree on this fact and follow a main backbone that includes diarization as a previous stage. With this backbone, we analyzed the following problems: voice activity detection, how to deal with noisy signals, domain mismatch, how to improve the clustering; and the overall impact of previous stages in the final speaker detection. In this paper, we show partial results for speaker diarizarion to have a better understanding of the problem and we present the final results for speaker detection
Speaker detection in the wild: Lessons learned from JSALT 2019
International audienceThis paper presents the problems and solutions addressed at the JSALT workshop when using a single microphone for speaker detection in adverse scenarios. The main focus was to tackle a wide range of conditions that go from meetings to wild speech. We describe the research threads we explored and a set of modules that was successful for these scenarios. The ultimate goal was to explore speaker detection; but our first finding was that an effective diarization improves detection, and not having a diarization stage impoverishes the performance. All the different configurations of our research agree on this fact and follow a main backbone that includes diarization as a previous stage. With this backbone, we analyzed the following problems: voice activity detection, how to deal with noisy signals, domain mismatch, how to improve the clustering; and the overall impact of previous stages in the final speaker detection. In this paper, we show partial results for speaker diarizarion to have a better understanding of the problem and we present the final results for speaker detection